Prosper is a peer-to-peer online lending marketplace, a silicon-valley endeavor to disrupt the current industry of personal loans provided by institutions. Prosper is the first company to provide an online marketplace for everyday investors to offer loans to everyday borrowers. Institutions (e.g. banks, credit unions, etc.) have traditionally been the avenue of loans as a means of mitigating the costs and distributing the risk involved in lending capital. In the following analysis I utilize anonymized data of Prospers users to explore borrowers ‘creditworthiness’ in order to describe the investors ‘risk’ and ‘reward’ for Prosper loans. I would like to explore the efficacy of Prosper’s borrower metrics to determine whether or not they provide viable means for individuals to invest their money in strangers with reasonable confidence in the outcome. If it can be shown that through Prosper’s lending marketplace investors and borrowers can connect in a mutually beneficial manner for the large majority of transactions this potentially describes a path to disrupting the traditional marketplace.
For those who have attempted to borrow money before and more importantly for those lending money (specifically to strangers) there is an important piece of information necessary for the transaction to proceed, assuming you wish to be paid back. The ‘creditworthiness’ of the borrower, this key ingredient to the business of loaning capital has over time and place known various forms. The current iteration, at least here in the United States, has resolved itself into the credit score. A number which strives to embody an individuals ‘creditworthiness’, which is an attempt to make quantifiable and subsequently discount-able the risk involved in providing money to another as a loan. Given that the credit score is such an important metric for quantifying the risk involved in lending money to a stranger, let’s briefly explore in general terms how these scores are currently arrived at. A creditor usually looks at three factors known as the “three Cs”: capacity, capital, and character.
Source How lenders Rate Credit Worthiness
Traditionally credit bureaus, also known as credit reporting agencies collect various information related to the three “C’s” above. Information related to an individual’s borrowing habits, their bill-paying habits (on-time, late, consistency of either) and previous loans. It can be shown that knowledge of this information can prove a valid indicator of future loan outcomes. All the information collected by credit reporting agencies is designed to reduce the “effect of asymmetric information between borrowers and lenders, to alleviate problems of adverse selection and moral hazard”(wikipedia). In other words by reviewing the fiscal habits of someone we have never met we can assess with some measure of accuracy the risk involved in lending that person money. The real question is how much accuracy and in the interest of determining the viability of Prospers business plan, is it enough for potential investors to put their money on the line?
Typically credit bureaus collect personal information related to an individuals credit worthiness by sourcing it from data furnishers, e.g. creditors, lenders, utilities, collection agencies, and courts. This information is then made available to those performing risk assessment through a credit background search. The Prosper data, anonymized but made available through an API provides us with an opportunity to explore the some of the data behind the ‘science’ of quantifying borrowers ‘creditworthiness’.
To begin I installed several packages and set the R-libraries required for the rest of this analysis. I then review the whole data-set, simplify, combine, and reduced many of the variables before venturing into the actual plotting and analysis. After finishing the preparation of the data I subdivide it into different groups to represent a ‘range’ of prosper users. Next I proceed to graph the remaining variables in uni-variate plots faceted on my groups. I review those graphs after which I narrow down my analysis to just a few key variables I hope to use to capture the essence of my question. Then I move to plotting my multi-variate charts and Final plots which I follow up with a description of the outcomes and my general thoughts on how the data reviewed thus far relates to and describes my goal. Next I create a fast and dirty supervised classification model to predict whether a loan will be successful or not, and briefly review which variables contributed the most to this model. I finish with an overall summary of my analysis and end with some reflections and conclusions on the analysis.
# Load all of the packages that I end up using
# in my analysis.
## The following packages were neccessary for my analysis,
## Uncomment to load for the fist time.
#install.packages("rmarkdown")
#install.packages("maps")
#install.packages("mapdata")
#install.packages("tidyr")
#install.packages("mosaic")
#install.packages("dunn.test")
#install.packages("polycor")
#install.packages("hexbin")
#install.packages("randomForest")
#install.packages("caret")
## I found the following libraries helpful in my analysis
library(ggplot2) # for plotting
library(GGally) # for upgraded pairs function
library(scales)
library(gridExtra) # for plotting multiple plots
library(maps) # for the choropleth map
library(mapdata) # for the choropleth map
library(dplyr) # for upgraded dataframe manipulation
library(tidyr) # for upgraded dataframe manipulation
library(mosaic)
library(FSA) # for dunn.test (non-parametric test of similarity)
library(RColorBrewer) # for ploting color scales
library(polycor) # for the hetcor correlation analysis
require(grid) # for ploting
library(caret) # for partioning dataframe into training and test set
library(randomForest) # for constructing supervised classfication model to
# measure importance of features
The data used for this exploratory analysis was collected from the peer-to-peer online lending platform Prosper. This data is provided to the public through both a RESTful API and as a large snapshot download in csv or XML file format. Recently Prosper has changed its public access to the snapshot data dump, and will only release snapshot data for 45 days after the close of each quarter. https://www.prosper.com/tools/DataExport.aspx
## Load the Prosper data from the local directory
ProsperData_full <- read.csv("prosperLoanData.csv", header=TRUE)
The snapshot data set used here was last updated 03/11/2014, it contains 113,937 loan listings with over 81 variables on each loan listing. Further details on all of the variables can be found here. In brief summary the data contains a large collection of variables associated with borrowers requesting loans, for example, data on borrowers credit, income, intended use, previous and current lines of credit and prior Prosper loans. Additionally Prosper’s proprietary variables are provided with the purpose of attempting to measure risk and profitability of potential loanees, not unlike a credit score.
## First look at the structure of the data-set
## Number of observations: 113937
## Number of features: 81
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
For several of these variables the reporting methods changed after 2009 (specific variables can be found by reading the variable descriptions but are generally related to Prosper’s proprietary variables), I filtered out all loans originating before 2009-01-01. Dates and times have been converted to a cleaner and easier format to work with using strptime.
# Convert ListingCreationDate using strptime, then filter by equality
# operator, > 2009 (109,posixt format)
filteredPost09 <- function(ProsperData_full) {
#
# input: Prosper dataframe containing all listings.
#
# output: Prosper dataframe containing only those listings dated
# on or after 2009.
#
# convert dates using strptime
ProsperData_full$ListingCreationDate = strptime(
ProsperData_full$ListingCreationDate,"%Y-%m-%d")
# Subset by dates on or after 2009
ProsperData_Post_09 = subset(ProsperData_full,
ProsperData_full$ListingCreationDate$year >= 109)
ProsperData_Post_09$ListingCreationDate = as.Date(
ProsperData_Post_09$ListingCreationDate, "%Y-%m-%d")
return(ProsperData_Post_09)
}
ProsperData_Post_09 <- filteredPost09(ProsperData_full)
cat(paste("Number of loan listings: ",dim(ProsperData_Post_09)[1], sep=''))
Number of loan listings: 84881
Looking over the variables I decided that for this analysis the remaining POSIXlt date times did not add value at their current level of detail that I was interested in exploring. I therefore converted them to ISO 8601 compliant dates for the year-month-day only.
## Convert to Date all POSIXlt where time detail is not necessary. Use as.
# Date on columns; FirstRecordedCreditLine, DateCreditPulled,
# LoanOriginationDate
convertDates <- function(ProsperData_Post_09) {
#
# Input: Prosper dataframe containing only those listings dated
# on or after 2009
#
# output: Prosper dataframe where all Date and Times have been converted
# into year-month-day format.
#
ProsperData_Post_09$FirstRecordedCreditLine <- as.Date(
strptime(ProsperData_Post_09$FirstRecordedCreditLine, "%Y-%m-%d"))
ProsperData_Post_09$DateCreditPulled <- as.Date(
strptime(ProsperData_Post_09$DateCreditPulled, "%Y-%m-%d"))
ProsperData_Post_09$LoanOriginationDate <- as.Date(
strptime(ProsperData_Post_09$LoanOriginationDate, "%Y-%m-%d"))
return(ProsperData_Post_09)
}
ProsperData_Post_09 = convertDates(ProsperData_Post_09)
In an effort to focus attention on a smaller number of variables I removed those variables which I felt did not provide value that I was interested in or seemed redundant or were summarized in other features. I also removed the unique key values for the members, the loans themselves, and the groups, preferring instead to use the loan listing key as my ‘primary key’.
## Drop the following variables;
# -ListingNumber (use ListingKey instead),
# -BorrowerRate (this value is incorporated in the BorrowerAPR,
# which I believe is more representative of acutal costs),
# -LenderYield (This value is incorporated in the EstimatedReturn),
# -EstimatedEffectiveYield (This value is incorporated in the Estimated
# Return),
# -EstimatedLoss (This value is incorporated in the EstimatedReturn),
# -ProsperRating..Alpha. (This value is duplicated in the
# ProsperRating..Numeric.),
# -ProsperScore (This value is used to determine the Prosper Rating,
# therefore its information is captured in that variable)
# -Occupation (Far to many categories to effectivly graph without
# attempting to aggregate into broader disciplines which is made
# difficult by ambiguity of user inputs.),
# -EmploymentStatus (This variable is caputured in the
# EmploymentStatusDuration and Annual Income),
# -OpenRevolvingAccounts (I will use OpenrevolvingMonthlyPayments as
# proxy of this variable),
# -TotalInquiries (More focused on inquiries for the last 6 months not a
# persons lifetime),
# -PublicRecordsLast10Years (Again more focused in the Public
# Records more recent than that of 10 years),
# -TotalCreditLinespast7years (A quick and dirty overiew of this data
# suggests a standard piosson distr. Given the large number of variables
# I will drop this variable and simply rely on current credit lines to
# capture the key information in this variable)
# -OpenCreditLines (I will rely on a combination of current credit lines
# and information of credit load, monthly credit payments, and debt ratio
# to explain the important essence of this variable.)
# -AvailableBankcardCredit (This information can be derived from
# BankcardUtilization (1-BankcardUtil.)),
# -RevolvingCreditBalance (Given the large number of variables I will
# place emphasis on the percentage of available credit used rather than
# then the total amount. I will drop this variable and use
# BankcardUtilization only)
# -TotalTrades (More interested in the trade lines open in last 6
# months rather then the lifetime of borrower.)
# -ScorexChangeAtTimeOfListing (For the sake of reducing the number
# variables I will drop this variable, This is interesting but will be
# sacrificed in pursuit of narrowing my focus),
# -LoanFirstDefaultedCycleNumber (Information is already captured by the
# loan Status variable, the other value It may have is in finding patterns in how
# long before default. This is not a direction I wish to take in this
# analysis),
# -LoanOriginationDate (I believe that the most important value this variable
# adds is in determining the age of the loan and LoanMonthsSinceOrigination
# already captures this, unless of course we are interested in any lurking
# effects associated with the loan origination date.),
# -LoanOriginationQuarter (Any trends gleened from this would be suspect
# for lurking variables associated with context of the time, this is not a direction
# I wish to take with this analysis),
# -MemberKey (Already have unique listing key I will forego this key),
# -ClosedDate (This information is available in less detail in the
# LoanStatus. The additional value it adds such as the average length of
# failed loans is interesting but again not a direction I wish to take in
# this analysis),
# -LoanMonthsSinceOrigination (Current Age of the loans is not a variable
# I will consider when reviewing risk and rewards of prospser loans, this
# variable I believe be more valuable in a longitudinal study with other
# variables.)
# -LoanKey (I already have unique listing key I will forego this key),
# -LoanNumber (Already have unique listing key I will forego this key),
# -LP_CustomerPayments (This variable is made of LP_CustomerPrincipalPayments
# and LP_InterestandFees, I intend to analysis just the interest or value
# gained from the loan. Therefore I will be dropping this variable.)
# -CurrentlyInGroup (Not really sure what value this adds, so I will avoid it.)
# -GroupKey (not interested in Groups),
# -PercentFunded (Not sure if this is just a symptom of when the data was
# collected or represents borrowers who are difficult to fund. Given that
# over 99% of loans are fully funded I will drop this variable)
ProsperData_Post_09_Subset <- subset(ProsperData_Post_09,
select = -c(ListingNumber, BorrowerRate, LenderYield,
EstimatedEffectiveYield, EstimatedLoss, ProsperRating..Alpha.,
ProsperScore, Occupation, EmploymentStatus, OpenRevolvingAccounts,
TotalInquiries, PublicRecordsLast10Years, AvailableBankcardCredit,
TotalTrades, ScorexChangeAtTimeOfListing, ClosedDate,
LoanFirstDefaultedCycleNumber, LoanOriginationDate,
LoanOriginationQuarter, MemberKey, LoanKey, CreditGrade,
LoanNumber, LP_CustomerPayments, CurrentlyInGroup, GroupKey,
PercentFunded, TotalCreditLinespast7years, OpenCreditLines,
LoanMonthsSinceOrigination) )
cat(paste("Number of features (variables): ",
dim(ProsperData_Post_09_Subset)[2],sep=''))
Number of features (variables): 51
After looking at the remaining data I checked all the variables for values that were ‘0’ or ‘NA’. Reviewing the results I removed a few variables on the bases that most of their values were ‘NA’s’ or ’0’s ( well over 90%).
## Review number of zero entries as a ratio for each variable
zerosRatio <- function(ProsperData_Post_09_Subset, plot=TRUE,
listValues = FALSE) {
#
# Input: Prosper Data frame, 'plot' variable (TRUE | FALSE) to toggle plot
# on/off (default == TRUE), 'listValue' variable (TRUE | FALSE) to toggle
# on/off list of variables with their respective percent zero values
# (default == FALSE)
#
# Ouput: Barplot of percent zero ratios for all variables in dataframe,
# optional list display of those variables with text representations of
# their percent zero ratios
#
# Count the number of non-zero values in each column
non_zero <- colSums(ProsperData_Post_09_Subset != 0.0, na.rm = TRUE)
percent_zero <- ((dim(ProsperData_Post_09_Subset)[1]-non_zero)/
dim(ProsperData_Post_09_Subset)[1]) *100
df = data.frame(percent_zero)
variables = row.names(df)
percent_zero = df$percent_zero
# for ease of sorting and graphing, created dataframe with 'variables'
# as a column and not row names.
df = data.frame(variables, percent_zero)
# rearrange rows, sorting by percent_zero ratio values
df_sorted = arrange(df, desc(percent_zero))
if (listValues == TRUE )
{
print (df_sorted)
}
else {
print ("sorting variables based on ratio of zero values")
}
# order the variables to match the percent_zero order for plotting
df_sorted$variables <- reorder(df_sorted$variables,
df_sorted$percent_zero)
if (plot == TRUE) {
# Note: plot x-axis is a little crowded
ggplot(data = df_sorted, aes(x=variables, y = percent_zero)) +
scale_y_continuous(breaks = seq(0,100,10),
labels = function(x){return(paste(x,"%",sep=''))}) +
geom_bar(stat='identity', colour=('white'), fill=('red')) +
ylab("Percent of zero values present in variable") +
xlab("Variables") +
theme(axis.text.x=element_text(angle = 75, hjust = 1))
}
else {
print ("Plot disabled")
}
}
zerosRatio(ProsperData_Post_09_Subset)
## [1] "sorting variables based on ratio of zero values"
After running a quick analysis on the number of zero values for each of the variables (features) a few variables jumped out. These variables contain mostly zero values (above 90%) and may have value to potential Prosper investors if only to highlight what you are unlikely to find in borrowers. The following variables had very high percentages of no data:
With the exception of the social variables (recommendations, investment from friends and investment from friends amount) the data and lack thereof suggests that not many prosper loans are defaulted on (again this is just a snapshot, cross-sectional look at data for one quarter, I am working on the assumption that it holds some predictive value for the average Prosper quarter). For the most part it appears as if the methods employed by Prosper to represent ‘creditworthiness’ allow lenders to make mostly valid assessments of a borrowers risk.
There are still far to many variables at this point for brevity and so little information is reported on the more socially focused variables I will not incorporate them in the rest of the analysis. It would be very interesting if these variables contained more data, suggesting that traditional lending paradigms could be adjusted to incorporate a social element in the risk analysis.
## Drop PublicRecordsLast12Months, Recommendations,
# InvestmentFromFriendsCount, InvestmentFromFriendsAmount
ProsperData_Post_09_Subset = subset(ProsperData_Post_09_Subset,
select = -c(PublicRecordsLast12Months,Recommendations,
InvestmentFromFriendsCount, InvestmentFromFriendsAmount))
cat(paste("Number of loan listings: ",dim(ProsperData_Post_09_Subset)[1]
,"\n","Number of features (variables): ",
dim(ProsperData_Post_09_Subset)[2], sep=''))
Number of loan listings: 84881
Number of features (variables): 47
I will generate several broad groups of loan listings which I will examine together with the other remaining variables. These groups are as follows; current loans, successful loans and unsuccessful loans. That is those loans which are completed or on their final payment vs. those loans that have been cancelled, charged-off or have been defaulted on. Lastly I will look at loans that are currently past due. I’ll subdivide these groups by first time Prosper users and prior Prosper users to see if any valuable information can be gleaned by looking at borrowers who were able to get another loan from Prosper lenders. The key motivation behind the groups is to shrink down the factor variable of loan status so that visual analysis will be cleaner with 4 levels rather then 12 levels that currently exist in the Loan Status variable. As a result I will lose information on the past due loans which are currently broken down into different lengths of lateness and on the different methods of loan failure. In the later multivariate analysis I will separate out the current loans and past due loans and look at just those loans that have completed or failed as a ground truth representation of the patterns and trends in successful or failed loan listings.
## Create variables describing Prosper Users: Prosper_User_Status, Loan_Outcome
generateGroups <- function(ProsperDataFrame){
#
# Input: Proser DataFrame
#
# Output: A prosper Data Frame with two new variables;
# Prosper_User_Status and Loan Outcome.
# Prosper Users Status: First Time Borrowers, Previous Borrowers.
# Loan Outcomes: Current, Successful, Past Due, Unsuccessful
#
ProsperDataFrame = mutate(ProsperDataFrame,
Prosper_User_Status = ifelse(!is.na(TotalProsperLoans),
"Previous Prosper User","First Time Prosper User"))
ProsperDataFrame = mutate(ProsperDataFrame,
Loan_Outcome = ifelse((LoanStatus == "Chargedoff" |
LoanStatus == "Cancelled" | LoanStatus == "Defaulted"),
"Unsuccessful Loan", ifelse((LoanStatus == "Completed" |
LoanStatus == "FinalPaymentInProgress"),"Successful Loan",
ifelse(LoanStatus == "Current", "Current Loan",
"Past Due Loan")))) #LoanStatus == "Current"
return(ProsperDataFrame)
}
ProsperData_Post_09_Subset = generateGroups(ProsperData_Post_09_Subset)
## Create several new variables, rename several old variables,
# refactor incorrectly imported variables, convert difftime to
# numeric for ease of use and finally drop variables used for combining
# into new variables that will no longer be apart of the analysis.
mutateData <- function(ProsperDataSet) {
#
# Input: Prosper datasets containing a subset of Prosper variables
#
# Output: A dataframe of Prosper data containing several new variables;
# 'credit_range','payment_to_monthly_income','credit_history_years
# ', and "lender_return". These variables will be created by
# transmuting and/or removing the following variables;
# CreditScoreRangeLower, CreditScoreRangeUpper, StatedMonthlyIncome,
# MonthlyLoanPayment,LP_InterestandFees,LP_NonPrincipalRecoverypayments
# FirstRecordedCreditLine, DateCreditPulled, LP_CollectionFess,
# LP_ServiceFees, LP_NetPrinciple Loss, LP_NonPrincipleRecoveryPayments.
#
# Rename variables where prudent and finally convert following integer
# variables to Factors; ProsperRating and ListingCategory.
#
## Combine credit lower and upper into a
# credit range and convert to factor. Convert date credit pulled and first
# recorded credit into a variable representing a borrowers credit history
# in years. Use monthly loan payment and monthy income to create a variable
# representing the portion of a borrowers income going towards loan
# payments. Finally using a combination of all the Interest fees collected
# by lenders minus all the loss principle, cost of collection fees and loan
# servicing fees (if any exist), to create a variable representing the
# return positive or negative on a lenders investment.
df_mutate = ProsperDataSet %>%
mutate(credit_range = as.factor(
paste(CreditScoreRangeLower ,CreditScoreRangeUpper, sep="-")),
credit_history_years = round((
DateCreditPulled-FirstRecordedCreditLine)/365,0),
payment_to_monthly_income = ifelse(StatedMonthlyIncome != 0,
round((MonthlyLoanPayment / (StatedMonthlyIncome)),2), 1.01),
lender_return = round((((LP_InterestandFees +
LP_NonPrincipalRecoverypayments) +
(LP_CollectionFees+LP_ServiceFees-LP_NetPrincipalLoss)) /
LoanOriginalAmount), 6))
# convert difftime to numeric (easier variable to work with)
df_mutate$credit_history_years = as.numeric(
df_mutate$credit_history_years)
## created to explore my lender_return variable and double check its
# results to make sure they are reasonable
df_furtherStudy <<- df_mutate
# Drop those variables no longer neccessary for calculations
df_drop = subset(df_mutate, select = -c(CreditScoreRangeLower,
CreditScoreRangeUpper, DelinquenciesLast7Years, StatedMonthlyIncome,
DateCreditPulled, FirstRecordedCreditLine, LP_CollectionFees,
LP_ServiceFees, OnTimeProsperPayments, LP_NetPrincipalLoss,
LP_GrossPrincipalLoss, LP_NonPrincipalRecoverypayments,
LP_CustomerPrincipalPayments,LP_InterestandFees,LoanStatus) )
# Rename variables containing uneccessary information
df_rename = rename(df_drop,
ListingDate = ListingCreationDate,
ProsperRating = ProsperRating..numeric. ,
ListingCategory = ListingCategory..numeric.,
EmploymentDurationMonths = EmploymentStatusDuration)
# Convert ProsperRating, ListingCategory,Prosper_User_Status and
# Loan_Outcomes to factors
df_rename$ProsperRating = as.factor(df_rename$ProsperRating )
df_rename$ListingCategory = as.factor(df_rename$ListingCategory)
df_rename$Prosper_User_Status = as.factor(df_rename$Prosper_User_Status)
df_rename$Loan_Outcome = as.factor(df_rename$Loan_Outcome)
return(df_rename)
}
Prosper_Post09_Subset <- mutateData(ProsperData_Post_09_Subset)
## Breifly explore my lender return variable to double check that it was constructed correctly and makes logical sense.
temp_df = (df_furtherStudy[,c("LoanOriginalAmount","LP_InterestandFees",
"LP_ServiceFees","LP_CollectionFees","LP_NetPrincipalLoss",
"LP_NonPrincipalRecoverypayments")])
df_sums = (df_furtherStudy$LP_InterestandFees +
df_furtherStudy$LP_NonPrincipalRecoverypayments) +
(df_furtherStudy$LP_ServiceFees +
df_furtherStudy$LP_CollectionFees -
df_furtherStudy$LP_NetPrincipalLoss)
by(data = df_furtherStudy,c(df_furtherStudy$Loan_Outcome),
function(x) summary(((x$LP_InterestandFees +
x$LP_NonPrincipalRecoverypayments) +
(x$LP_ServiceFees+x$LP_CollectionFees -
x$LP_NetPrincipalLoss))/
x$LoanOriginalAmount))
temp_df1 <- df_furtherStudy %>%
filter(Prosper_User_Status == "Previous Prosper User",
Loan_Outcome == "Successful Loan")
temp_df1 <- mutate(temp_df1, sign = lender_return >= 0.0)
Utilizing ggplot I graph most of the variables remaining at this point. I will facet on my groups, technically this is introducing another variable so it is not truly uni-variate(although one of the facets will contain all the data). By faceting I hope to highlight the variables containing interesting explanatory power as they relate to my groups. I will not continue exploring many of these variables past this point in my analysis so I wish to briefly explore their relevancy to my groups and ultimately how they may influence a borrowers ‘creditworthiness’.
I apply a pair-wise Mann Whitney U test (sometimes I use a Dunn test (Kruskal-Wallis multiple comparisons)) to each of my variables to analytically determine significant differences between my groups. I use these tests to make as few assumptions as possible about the distributions and because many of my variables are categorical. The use of the analytical tests is to back up my assessments of the graphs several of whose distributions look very similar across my facets. In addition I was looking to add some statistical rigor to my assessments. I generate a couple of ad-hoc functions to perform graph specific transformations and alteration in order to make more sensible graphs for several of the variables but these few alterations to the dataset are temporary and graph specific.
## Reorder or re-level factor variables for more sensible graph labeling
# referenced: http://www.cookbook-r.com/Manipulating_data/Changing_the_order_of_levels_of_a_factor/
# -- Changing the order of levels of a factor
reorderFactorVariables <- function(DataFrame) {
#
# Input: Prosper DataFrame
#
# Output: Prosper DataFrame whose factor variables have been reorder or
# properly leveled for more logical plots.
#
DataFrame$IncomeRange <- factor(DataFrame$IncomeRange,
levels=c("Not Employed", "$0","$1-24,999","$25,000-49,999",
"$50,000-74,999","$75,000-99,999","$100,000+"))
DataFrame$Loan_Outcome <- factor(DataFrame$Loan_Outcome,
levels = c("Current Loan","Successful Loan", "Past Due Loan",
"Unsuccessful Loan"))
return(DataFrame)
}
Prosper_Post09_Subset = reorderFactorVariables(Prosper_Post09_Subset)
Final review of data-set structure
Number of loan listings: 84881
Number of features: 38
## 'data.frame': 84881 obs. of 38 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7193 6669 6686 6689 6699 6706 6687 6687 6712 6731 ...
## $ ListingDate : Date, format: "2014-02-27" "2012-10-22" ...
## $ Term : int 36 36 36 60 36 36 36 36 60 36 ...
## $ BorrowerAPR : num 0.12 0.125 0.246 0.154 0.31 ...
## $ EstimatedReturn : num 0.0547 0.06 0.0907 0.0708 0.1107 ...
## $ ProsperRating : Factor w/ 7 levels "1","2","3","4",..: 6 6 3 5 2 4 7 7 4 5 ...
## $ ListingCategory : Factor w/ 20 levels "0","1","2","3",..: 3 16 3 2 2 3 7 7 2 2 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 12 25 34 18 6 16 16 22 3 ...
## $ EmploymentDurationMonths : int 44 113 44 82 172 103 269 269 300 1 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 2 1 1 ...
## $ CurrentCreditLines : int 14 5 19 21 10 6 17 17 2 9 ...
## $ OpenRevolvingMonthlyPayment : num 389 115 220 1410 214 101 219 219 25 290 ...
## $ InquiriesLast6Months : int 3 0 1 0 0 3 1 1 1 1 ...
## $ CurrentDelinquencies : int 0 4 0 0 0 0 0 0 1 0 ...
## $ AmountDelinquent : num 0 10056 0 0 0 ...
## $ RevolvingCreditBalance : num 3989 1444 6193 62999 5812 ...
## $ BankcardUtilization : num 0.21 0.04 0.81 0.39 0.72 0.13 0.11 0.11 0.51 0.7 ...
## $ TradesNeverDelinquent..percentage. : num 1 0.76 0.95 1 0.68 0.8 1 1 0.72 1 ...
## $ TradesOpenedLast6Months : num 2 0 2 0 0 0 1 1 0 0 ...
## $ DebtToIncomeRatio : num 0.18 0.15 0.26 0.36 0.27 0.24 0.25 0.25 0.12 0.18 ...
## $ IncomeRange : Factor w/ 7 levels "Not Employed",..: 5 4 7 7 4 4 4 4 6 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ TotalProsperLoans : int NA NA 1 NA NA NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA 11 NA NA NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA 0 NA NA NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA 0 NA NA NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA 11000 NA NA NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA 9948 NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 3 ...
## $ LoanOriginalAmount : int 10000 10000 15000 15000 3000 10000 10000 10000 13500 4000 ...
## $ MonthlyLoanPayment : num 319 321 564 342 123 ...
## $ Investors : int 1 158 20 1 1 1 1 1 19 1 ...
## $ Prosper_User_Status : Factor w/ 2 levels "First Time Prosper User",..: 1 1 2 1 1 1 1 1 1 1 ...
## $ Loan_Outcome : Factor w/ 4 levels "Current Loan",..: 1 1 1 1 1 1 1 1 1 3 ...
## $ credit_range : Factor w/ 15 levels "600-619","620-639",..: 5 11 5 8 5 6 12 12 3 5 ...
## $ credit_history_years : num 18 30 10 41 13 14 21 21 22 17 ...
## $ payment_to_monthly_income : num 0.05 0.11 0.06 0.04 0.06 0.11 0.09 0.09 0.05 0.06 ...
## $ lender_return : num 0 0.0944 0.0798 0.0201 0.1998 ...
Univariate Plot Functions
The following plots are a first brush look at the remaining variables to highlight any interesting trends in the variables, if they exist, as they relate to my facet groups and goal.
Helper Functions
## Function to return a sample subset of dataframe for faster computation and
## interation times in the initial construction of graphs. Will not use for
## final plots and analysis
sampleDataFrame <- function(DataFrame, samplesize) {
#
# Input: DataFrame, samplesize
#
# Output: DataFrame containing a sample of rows equal to the provided
# samplesize in the parameter
#
return(sample_n(DataFrame, size = samplesize))
}
## Convert DataFrame's BorrowState abbrv. to full name for choropleth map
stateMapPrep <- function(DataFrame) {
#
# Input: Prosper DataFrame
#
# Output: New subsetted Prosper DataFrame with 'State' (Borrower State
# variables have been coverted from abbreviation to full state names)
# and 'Freq' the cummulative count of each state as they occured in the
# full data set.
#
# transmute each abbreviation of BorrowState to full name
DataFrame$BorrowerState = tolower(state.name[match(
DataFrame$BorrowerState, state.abb)])
# using unique and table create new dataframe contaning the states and
# their respective counts in the input DataFrame
DataFrame_StateFreq = data.frame(with(DataFrame,
prop.table(table(BorrowerState))))
DataFrame_final = rename(DataFrame_StateFreq, state = BorrowerState)
return(DataFrame_final[!is.na(DataFrame_final$state),])
}
## Plot a map of the united states where color represents frequency of Borrowers
## References used explicity for the choropleth map are to be found here;
# http://blog.revolutionanalytics.com/2009/10/geographic-maps-in-r.html,
#https://uchicagoconsulting.wordpress.com/tag/r-ggplot2-maps-visualization/,
# https://trinkerrstuff.wordpress.com/2013/07/05/ggplot2-chloropleth-of-supreme-court-decisions-an-tutorial/
#https://trinkerrstuff.wordpress.com/2013/07/05/ggplot2-chloropleth-of-supreme-court-decisions-an-tutorial/
choropleth_map <- function(ProsperDataFrame){
#
# Input: Prosper Data Frame containing a variable with state abbreviations
#
# Output: choropleth map based on the number of those state abbreviations
# seen in the dataframe.
#
# using maps packages pull state data; longitude and latitude
states_map <- map_data("state")
states = data.frame(state.name)
# rename states variables for merging
states <- select(states, state = state.name)
# apply map preperation function and use shorthand name to reference it
map_df <- stateMapPrep(ProsperDataFrame)
# Merge data set so that all states are represented
map_ready_df <- merge(states, map_df, by="state", all=TRUE)
# Replace NA's for states without any borrowers with zeros values.
map_ready_df[is.na(map_ready_df)] <- 0
plot_state <- ggplot(map_ready_df, aes(map_id = state) ) +
geom_map(aes(fill = Freq), map = states_map, colour = "Black") +
expand_limits(x = states_map$long, y = states_map$lat) +
scale_fill_gradient(low="white", high="red") +
theme(legend.position = "bottom",
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()) +
guides(fill = guide_colorbar(barwidth = 9, barheight = .7)) +
ggtitle("Choropleth map of Loan Listings by Locations")
return(plot_state)
}
## Found the following function which I used to plot my sqrt transformation
# to achieve more visually valuable breaks in my charts here:
# https://groups.google.com/forum/#!topic/ggplot2/IUje5H0jwm4. credit
# goes to Brian S. Diggs, PhD
#Senior Research Associate, Department of Surgery
#Oregon Health & Science University
mysqrt_trans <- function() {
trans_new("mysqrt",
transform = base::sqrt,
inverse = function(x) ifelse(x<0, 0, x^2),
domain = c(0, Inf))
}
[1] "Borrower Residences by State"
The choropleth map (of borrower frequency) appears to fall victim to the yet another population map syndrome. There could be more to this map then just a representation of population demographics but for future analysis I will no longer explore the state location of loan listings.
[1] "Loan Term Duration"
Plot suggests differences between my groups in loan term durations. There appears to be fewer past due payments on 12 month loans ( for both new and returing borrowers). There is an increased number of past due loanees with 60 month loans. The overall trend for Prosper loans is 36 months, although it does appear that previous Prosper borrowers are more likely to apply for 12 month loans. As the most popular loan length is 36 months it is not to suprising that most successful loans are 36 months and most unsuccessful loan are also 36 months. there is a relatively high success rate for 12 month long loans. the decrease in successful loans at 60 months could simply a result of their timeframes and the number of them concluding in this quarter. Would recommend being very careful about drawing strong inferences from this variable.
[1] "Listing Date"
While it is clear that differences exist between my groups it is not clear that it is anything more then simply the result of time. Shifts in distribution make sense intuitively if we recall that my groups themselves are often time dependent. No doubt useful information exists here, but I am not prepared to explore whether the loan origination date is a vaild indicator of a borrowers ‘creditworthiness’. Especially not with snapshot data, this information would be more interesting in a study of Prosper loan listings over a much longer time frame.
[1] "Borrower APR"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$BorrowerAPR and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 <2e-16 <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$BorrowerAPR and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Some interesting trends occurring here. On average previous borrowers are not subject to as high an APR rate, why is that? a change in policy?. It also appears that borrowers who are past due or unsuccessful in paying back there loans are subject to a higher interest on average. This likely indicates that Prosper’s method of measuring risk (and subsequently discounting it) contains some efficacy. It is interesting to note the number of borrowers with an APR as high as ~35% accross all groups. This really high percetage of ~35% for Borrowers APR applies only to first time borrowers. clearly we would expect riskier borrowers to pay higher interest rates but why the uneven distribution and heavy skew to one really high APR?
[1] "Lenders Estimated Return Rate"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$EstimatedReturn and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 <2e-16 <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$EstimatedReturn and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User 1.1e-12
P value adjustment method: holm
There is a trend of higher estimated returns associated with loanees past due or who have failed on their loans (i.e. higher risk). It is interesting to note that the very high borrower’s APR of ~35% for a large percentage of borrowers is not reflected (at least not yet) with as much consistency here in lenders estimated return rates. Additionally it does appear that for previous borrowers there is a slight shift towards lower return rates accross all loan groups. Are previous borrower’s estimated return rates more accurate given Prosper’s prior knowledge of borrower’s performance on previous Prosper loans?
[1] "Prosper Proprietary Rating (Numeric)"
There is a clear trend in first time borrowers of lower ratings for past due and unsuccessful loans (again this is good news for Prosper’s rating system). Additionally it appears that for successuful loans previous borrowers have higher ratings. With actual data on previous Prosper users’ loan performance, Prosper’s scoring system looks to be more accurate in ranking riskiness. This might indicate that Prosper’s rating system for returning user is a slighter better estimator then for first time users.
[1] "Reasons for Loan"
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Holm method.
Comparison Z P.unadj
1 Current Loan - Successful Loan -43.707253 0.000000e+00
2 Current Loan - Past Due Loan -12.734994 3.778771e-37
3 Successful Loan - Past Due Loan 3.249019 1.158039e-03
4 Current Loan - Unsuccessful Loan -31.734005 5.278189e-221
5 Successful Loan - Unsuccessful Loan -4.154483 3.260239e-05
6 Past Due Loan - Unsuccessful Loan -5.329860 9.828875e-08
P.adj
1 0.000000e+00
2 1.511508e-36
3 1.158039e-03
4 2.639095e-220
5 6.520479e-05
6 2.948663e-07
Overall it looks like most loans are requested for debt consolidation. Some of the clearest differences are actually between unsuccessful loans and successful loans. There are more unsuccessful loans proportionally for business, Green Loans, House Expenses and large purchases. Digging into this variable some more would be interesting but probably not a direction I will take in this analysis.
[1] "Length of Employment"
Comparison Z P.unadj P.adj
1 Current Loan - Successful Loan 20.049898 2.022589e-89 1.213553e-88
2 Current Loan - Past Due Loan 5.576379 2.455761e-08 9.823043e-08
3 Successful Loan - Past Due Loan -1.748811 8.032365e-02 8.032365e-02
4 Current Loan - Unsuccessful Loan 15.464675 6.007529e-54 3.003764e-53
5 Successful Loan - Unsuccessful Loan 2.738871 6.165050e-03 1.233010e-02
6 Past Due Loan - Unsuccessful Loan 3.155184 1.603971e-03 4.811912e-03
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$EmploymentDurationMonths and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User 3.2e-14
P value adjustment method: holm
After running an analytical test with error rate adjusted using “holm” method, there appears to be a statistically significant difference between the distributions of some of the loan outcomes and the prosper user statuses. Current Prosper borrowers have a longer work history on average then previous borrowers. The differences are not visually extreme but this variable may hold some explanatory power in understanding the attributes of ‘creditworthiness’ of borrowers. We do see higher concentration of no work history in past due and unsuccessful loans, however we also see high success rates for borrowers with just a little work history.
[1] "Is Borrower a Homeowner"
Trend towards more homeowners with successful loans, one exception is previous Prosper borrowers who are past due. This variable has appears to have explanatory power as it relates to my groups and question. While not a huge indicator, owning a home does suggest slightly less risk in a borrower.
[1] "Number of Credit Lines"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$CurrentCreditLines and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan < 2e-16 - -
Past Due Loan < 2e-16 0.68 -
Unsuccessful Loan < 2e-16 < 2e-16 5.2e-13
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$CurrentCreditLines and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
For first time borrowers, riskier(unsuccessful or past due) loanees tend to have less credit lines on average. This feature has a statistically significant difference and likely different distributions but it is seen in this graph as a small difference. The difference between first time and previous borrowers is interesting, suggesting slightly more homogeny in previous borrowers credit line distribution between loan outcomes.
[1] "Monthly Payments on Revolving Accounts"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$OpenRevolvingMonthlyPayment and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan < 2e-16 - -
Past Due Loan < 2e-16 0.0076 -
Unsuccessful Loan < 2e-16 < 2e-16 1.7e-12
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$OpenRevolvingMonthlyPayment and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
The data shows a trend whereby successful borrowers make just slightly larger monthly payments on their revolving credit accounts then unsuccessful borrowers. Past due borrowers may actually spend the same or more then successful borrowers, this is strange as past due loanees very often mimic unsuccessful loanees in other variables. This suggests that how much you pay on monthly revolving open accounts has some explanatory powers as to whether or not you will be a successful Prosper borrower. The difference between first time and previous borrowers is more subtle, it does appears that previous prosper borrowers who are past due or unsuccessful tend to spend more on monthly revolving payments then their first time counterparts. Overall previous users spend less monthly then first time users.
[1] "Inquiries in the Last 6 Months"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$InquiriesLast6Months and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan < 2e-16 - -
Past Due Loan < 2e-16 0.012 -
Unsuccessful Loan < 2e-16 < 2e-16 3.6e-07
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$InquiriesLast6Months and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
There appear to be more inquiries into borrowers whose loans are currently past due or unsuccessful. The differences between previous and first time borrowers could be a result of the first credit inquiry Propser made for the borrowers first Prosper loan. I would have guessed that there would be more delineation between the groups in this variable.
[1] "Number of Currently Delinquent Accounts"
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Holm method.
Comparison Z P.unadj
1 Current Loan - Successful Loan -1.27060124 2.038705e-01
2 Current Loan - Past Due Loan -10.71272890 8.870771e-27
3 Successful Loan - Past Due Loan -9.92751797 3.160239e-23
4 Current Loan - Unsuccessful Loan -18.18945752 6.255625e-74
5 Successful Loan - Unsuccessful Loan -15.97660406 1.860071e-57
6 Past Due Loan - Unsuccessful Loan -0.03703119 9.704601e-01
P.adj
1 4.077411e-01
2 3.548308e-26
3 9.480716e-23
4 3.753375e-73
5 9.300357e-57
6 9.704601e-01
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$CurrentDelinquencies and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
While some differences do exist between my groups of borrowers and groups of loan outcomes, it does not appear visually to be very large. I will consider looking into this variable further as it relates to other variables. There are more delinquent accounts for those borrowers who are past due on their Prosper loans, this is not too suprising. I would argue what is more suprising is the difference between successful loans and unsuccessful loans as it relates to prior delinquencies, before seeing this plot I would have guessed the difference to be greater.
[1] "Amount Delinquent at time of credit pull"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$AmountDelinquent and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan 0.76 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 <2e-16 0.86
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$AmountDelinquent and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Visual differences are slight. Looking at the scales reveals that most borrowers do not have any amount delinquent at the time of their credit being pulled. As a result I will not consider this variable a very strong indicator of borrowers ‘creditworthiness’.
[1] "Revolving Credit Balance"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$RevolvingCreditBalance and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 0.32 -
Unsuccessful Loan <2e-16 <2e-16 7e-07
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$RevolvingCreditBalance and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Past due and unsuccessfull loans have a higher percentage of borrowers with access to little or no revolving credit on average. For successful loans first time and previous borrowers access to revolving credit looks to be almost the same but previous borrowers who are late on their loans tend to have access to less revolving credit then their first time counterparts. For unsuccessful loans previous borrowers appear to have more access to revolving credit then first time borrowers. I wonder what is going on here?
[1] "Percentage of Bank Card Utilization"
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Holm method.
Comparison Z P.unadj
1 Current Loan - Successful Loan 26.317426 1.211702e-152
2 Current Loan - Past Due Loan 1.257696 2.085019e-01
3 Successful Loan - Past Due Loan -8.168072 3.133566e-16
4 Current Loan - Unsuccessful Loan 13.862090 1.074983e-43
5 Successful Loan - Unsuccessful Loan -2.315647 2.057758e-02
6 Past Due Loan - Unsuccessful Loan 6.134866 8.523108e-10
P.adj
1 7.270211e-152
2 2.085019e-01
3 1.253427e-15
4 5.374917e-43
5 4.115515e-02
6 2.556932e-09
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$BankcardUtilization and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User 2.3e-07
P value adjustment method: holm
Interestingly a large number of Prosper users do not have any credit utilized, at least when their credit was pulled. Additionally it looks like those borrowers who are unsuccessful in their loans have higher percentages of no bankcard utilization. For past due borrowers we see higher percentages on average of nearly 100% bank card utilization. Between the groups of first time and previous borrowers it looks like previous borrowers have higher instances of 100% bank card utilization on avearge across the loan outcomes.
[1] "Percent of never Delinquent Trade Lines (credit)"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$TradesNeverDelinquent..percentage. and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan 8.0e-12 - -
Past Due Loan < 2e-16 6.2e-16 -
Unsuccessful Loan < 2e-16 < 2e-16 0.33
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$TradesNeverDelinquent..percentage. and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Nearly the same distribution across all groups visually, although analytical tests suggest differences between groups, other variables have clearer visual differences. It does look to be the case that for unsuccessful borrowers there is lower percentage of trades never delinquent.
[1] "Number of Trade lines opened in the Last 6 Months"
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Holm method.
Comparison Z P.unadj P.adj
1 Current Loan - Successful Loan 1.825535 6.792033e-02 1.358407e-01
2 Current Loan - Past Due Loan -5.081708 3.740560e-07 1.122168e-06
3 Successful Loan - Past Due Loan -5.575335 2.470534e-08 9.882135e-08
4 Current Loan - Unsuccessful Loan -10.623358 2.320596e-26 1.160298e-25
5 Successful Loan - Unsuccessful Loan -10.798795 3.487515e-27 2.092509e-26
6 Past Due Loan - Unsuccessful Loan -1.060514 2.889110e-01 2.889110e-01
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$TradesOpenedLast6Months and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Visually very similar, analytical test suggests statistically significant difference between some of my loan groups. I am not sure there is much to learn from this variable as there may be from others.
[1] "Debt to Income Ratio of Borrowers"
Comparison Z P.unadj
1 Current Loan - Successful Loan 35.1790650 4.179548e-271
2 Current Loan - Past Due Loan -0.2295163 8.184677e-01
3 Successful Loan - Past Due Loan -12.7085870 5.298313e-37
4 Current Loan - Unsuccessful Loan 3.1122072 1.856941e-03
5 Successful Loan - Unsuccessful Loan -16.6465368 3.206279e-62
6 Past Due Loan - Unsuccessful Loan 1.8612050 6.271523e-02
P.adj
1 2.507729e-270
2 8.184677e-01
3 2.119325e-36
4 5.570824e-03
5 1.603139e-61
6 1.254305e-01
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$DebtToIncomeRatio and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
There appears to be statistically significant differences between first time and previous borrowers allthough the visual differences are minute. It is the case that on average successful loans tend to have borrowers with a lower debt to income ratio then past due and unsuccessful borrowers. It is equally as true however that for successful borrowers the most common debt to income ratio is approximately the same as past due and unsuccessful borrowers.
[1] "Income Range of Borrowers"
There exists a noticable differences between loan outcomes. With indications that successful loanees tend to have higher incomes then past due and unsuccessful loanees. I found one visually noticable difference between first time and previous borrowers, with previous borrowers who are past due being more likely to be split between low income and high income ranges with less inbetween. Also interesting is the slightly higher income range on average for previous borrwers on unsuccessful loans.
[1] "Verifiable Income Status"
While little visual difference exists between first time and previous borrowers (very slight increase in non-verifiable incomes), there is an increase in non-verifiable incomes as we go down from successful to past due and finally to unsuccessful loans. I am a little suprised that any unverifiable income borrowers are given loans, I would consider this very risky, hopefully Prosper’s rating system agrees.
#sum(Prosper_Post09_Subset$IncomeVerifiable == "False")
summary(Prosper_Post09_Subset$ProsperRating[
Prosper_Post09_Subset$IncomeVerifiable == "False"])
1 2 3 4 5 6 7 NA's
1200 1183 1452 1469 959 849 221 1
[1] "Number of on time Prosper Payments made \n (applies to previous borrowers only)"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$TotalProsperPaymentsBilled and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan 0.00058 - -
Past Due Loan 0.00018 0.01962 -
Unsuccessful Loan 1.5e-09 0.00087 0.90359
P value adjustment method: holm
Some interesting things to see here about previous borrowers and how diligent they were in paying their prior Prosper loans. The distribution seems to show that borrowers who are currently past due or unsuccessful made less payments on their previous loans. Overall I am not sure if the distributions are a result of shorter loan terms or paying loans off early or failing to pay off loans entirely before getting a new loan.
[1] "Prosper Payments Less Then One Month Late \n (applies to previous borrowers only)"
For previous borrowers there is a clear shift in distributions showing that past due and unsuccessful borrowers have more ‘less then one month late payments’. It is possible that prosper is able to far more accurately rate returning borrowers as result of such clear differences as this one.
[1] "Prosper Payments More Than One Month Late \n (applies to previous borrowers only)"
Very small differences between loan outcomes. It appears that very few previous Prosper’s user had any late payments. This could be a result of investors having more knowledge of borrowers who have already had a Prosper loan and are unwilling to invest in a borrower who was more then 1 month late on their previous Prosper loan.
[1] "Prosper Principle Borrowers by Previous Borrowers \n (applies to previous borrowers only)"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$ProsperPrincipalBorrowed and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan < 2e-16 - -
Past Due Loan 4.1e-07 0.0027 -
Unsuccessful Loan < 2e-16 0.0033 1.0e-05
P value adjustment method: holm
Appears to be a trend whereby higher loan priniciples are associated with a larger number successful loans on average.
[1] "Prosper Principle Outstanding (applies to previous borrowers only)"
Very interesting differences between loan outcomes. Successful loans appear to have much smaller to non-exist prior principles outstanding. Between past due and unsuccessful loans past due borrowers have a more bimodal distribution and unsuccessful loans are more reflective of successful loans but show higher outstanding prior principles on average. Not entirely sure why this is.
[1] "Current Days Delinquent"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$LoanCurrentDaysDelinquent and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$LoanCurrentDaysDelinquent and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan - - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 <2e-16 <2e-16
P value adjustment method: holm
While there is a difference between first time and previous borrowers I am not sure that I want to delve any deeper into this particular variable. It is evident that days delinquent will have a direct relationship with my loan outcome groupings. The interesting trends in this data are the slightly different distributions between first time borrowers and previous borrowers.
[1] "Amount of the Loan Origination"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$LoanOriginalAmount and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$LoanOriginalAmount and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 0.57 <2e-16
P value adjustment method: holm
I see very similar loan origination amounts roughly $4000, $10,000 and $16,000 being far and away the most commonly requested loan amounts. The difference between the groups lie in the distributions between these popular amounts. $4,000 loans appear to have the most risk, with very high percentages of past due and unsuccessful rates, however they are also the most common loan origination amount.
[1] "Monthly Propser Loan Payment"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$MonthlyLoanPayment and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 0.099 <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$MonthlyLoanPayment and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Most loan payments hover at $150 likely a result of a popular loan payment at 36 months ($4000 origination?). previous borrowers tend to be more normally distributed when it comes to ranges of monthly loan payments, probably the result of having a more normal distribution of loan origination amounts. The difference between the loan outcomes show a trend towards current loanees with higher monthly payments on what I presume are larger loans. There does appear to be less loan failures in the higher monthly loan payments.
[1] "Number of Investors"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$Investors and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 <2e-16 <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$Investors and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
Trend towards fewer investors in current loans whereas successful loans tend to have more natural poisson distribution. There also tends to be a higher rate of past due and unsuccessful loans with few investors.
[1] "Credit Range"
The credit range contains some interesting data between the loan outcomes but the difference between the first time and previous borrowers is likely affected by a policy change at Prosper that made a minimum credit score of 640 a requirement for borrowers, where no minimum existed before (this meant a different distribution existed, therefore the previous borrowers distribution in not exclusively the result of being a returning borrower, but includes a different population of borrowers). As a result I will not directly compare this variable across first time and previous borrowers and will instead look at differences between outcomes understanding that the loan listings have a few listings where the bottom population in terms of credit scores has been removed.
summary(Prosper_Post09_Subset$credit_range)
600-619 620-639 640-659 660-679 680-699 700-719 720-739 740-759 760-779
1040 1655 8852 14137 14021 13616 11037 7876 5255
780-799 800-819 820-839 840-859 860-879 880-899
3705 2107 1043 398 122 17
sum(Prosper_Post09_Subset$credit_range %in% c("600-619","620-639"))
[1] 2695
[1] "Length of Credit History"
Dunn (1964) Kruskal-Wallis multiple comparison
p-values adjusted with the Holm method.
Comparison Z P.unadj
1 Current Loan - Successful Loan 35.623973 5.961515e-278
2 Current Loan - Past Due Loan 4.616256 3.907251e-06
3 Successful Loan - Past Due Loan -8.233023 1.825468e-16
4 Current Loan - Unsuccessful Loan 20.218324 6.753821e-91
5 Successful Loan - Unsuccessful Loan -1.799136 7.199710e-02
6 Past Due Loan - Unsuccessful Loan 6.488188 8.687482e-11
P.adj
1 3.576909e-277
2 7.814501e-06
3 7.301872e-16
4 3.376911e-90
5 7.199710e-02
6 2.606245e-10
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$credit_history_years and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User 3.8e-14
P value adjustment method: none
Not an aweful lot of visual difference, there might be a slight tendency towards shorter credit history on past due and unsuccesful loans but its hard to pick out visually. The analytcial test does suggest likely different distributions among most of the loan outcomes and the user statuses.
[1] "Loan Payment to Monthly Income"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$payment_to_monthly_income and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan < 2e-16 - -
Past Due Loan 0.442 < 2e-16 -
Unsuccessful Loan 2.8e-09 < 2e-16 0.019
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$payment_to_monthly_income and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
We see a larger portion of successful loan outcomes are associated with loan payments constituting a smaller portion of a borrowers monthly income. It also looks like previous borrowers more then first time borrowers tend to borrow loans such that their payments to income ratio is less on average then first time borrowers. For the most part monthly loan payments do not exceed ~20% of income for any borrower ( a few exceptions in the unsuccessful loan outcome group).
[1] "Lender Return Rate"
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$lender_return and Prosper_Post09_Subset$Loan_Outcome
Current Loan Successful Loan Past Due Loan
Successful Loan <2e-16 - -
Past Due Loan <2e-16 <2e-16 -
Unsuccessful Loan <2e-16 <2e-16 <2e-16
P value adjustment method: holm
Pairwise comparisons using Wilcoxon rank sum test
data: Prosper_Post09_Subset$lender_return and Prosper_Post09_Subset$Prosper_User_Status
First Time Prosper User
Previous Prosper User <2e-16
P value adjustment method: holm
We can see from looking at all of the combined data that very few loans result in a loss, but when they do (unsuccessful loans) the loss is usually quite high, almost the whole value of an investors principle.
Term BorrowerAPR EstimatedReturn ProsperRating
Min. :12.00 Min. :0.04583 Min. :-0.18270 4 :18345
1st Qu.:36.00 1st Qu.:0.16328 1st Qu.: 0.07408 5 :15581
Median :36.00 Median :0.21945 Median : 0.09170 6 :14551
Mean :42.48 Mean :0.22665 Mean : 0.09607 3 :14274
3rd Qu.:60.00 3rd Qu.:0.29254 3rd Qu.: 0.11660 2 : 9795
Max. :60.00 Max. :0.42395 Max. : 0.28370 (Other):12307
NA's :28 NA's : 28
ListingCategory BorrowerState EmploymentDurationMonths
1 :53193 CA :10762 Min. : 0.0
7 : 9225 NY : 5845 1st Qu.: 30.0
2 : 6803 TX : 5637 Median : 74.0
3 : 5301 FL : 5406 Mean :103.1
6 : 2238 IL : 4265 3rd Qu.:148.0
13 : 1996 OH : 3375 Max. :755.0
(Other): 6125 (Other):49591 NA's :19
IsBorrowerHomeowner CurrentCreditLines OpenRevolvingMonthlyPayment
False:40018 Min. : 0.00 Min. : 0.0
True :44863 1st Qu.: 7.00 1st Qu.: 156.0
Median :10.00 Median : 311.0
Mean :10.51 Mean : 430.7
3rd Qu.:13.00 3rd Qu.: 563.0
Max. :59.00 Max. :13765.0
InquiriesLast6Months CurrentDelinquencies AmountDelinquent
Min. : 0.0000 Min. : 0.0000 Min. : 0.0
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0
Median : 0.0000 Median : 0.0000 Median : 0.0
Mean : 0.9646 Mean : 0.3224 Mean : 950.5
3rd Qu.: 1.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0
Max. :27.0000 Max. :51.0000 Max. :463881.0
RevolvingCreditBalance BankcardUtilization
Min. : 0 Min. :0.0000
1st Qu.: 3823 1st Qu.:0.3300
Median : 9323 Median :0.6000
Mean : 17938 Mean :0.5642
3rd Qu.: 20337 3rd Qu.:0.8300
Max. :999165 Max. :2.5000
TradesNeverDelinquent..percentage. TradesOpenedLast6Months
Min. :0.0800 Min. : 0.0000
1st Qu.:0.8500 1st Qu.: 0.0000
Median :0.9500 Median : 0.0000
Mean :0.9059 Mean : 0.7299
3rd Qu.:1.0000 3rd Qu.: 1.0000
Max. :1.0000 Max. :20.0000
DebtToIncomeRatio IncomeRange IncomeVerifiable
Min. : 0.000 $50,000-74,999:25638 False: 7334
1st Qu.: 0.150 $25,000-49,999:24184 True :77547
Median : 0.220 $100,000+ :15209
Mean : 0.259 $75,000-99,999:14499
3rd Qu.: 0.320 $1-24,999 : 4657
Max. :10.010 (Other) : 45
NA's :7297 NA's : 649
TotalProsperLoans TotalProsperPaymentsBilled
Min. :0.00 Min. : 0.0
1st Qu.:1.00 1st Qu.: 10.0
Median :1.00 Median : 18.0
Mean :1.46 Mean : 24.3
3rd Qu.:2.00 3rd Qu.: 35.0
Max. :8.00 Max. :141.0
NA's :65059 NA's :65059
ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate
Min. : 0.00 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0.00
Median : 0.00 Median : 0.00
Mean : 0.66 Mean : 0.05
3rd Qu.: 0.00 3rd Qu.: 0.00
Max. :42.00 Max. :21.00
NA's :65059 NA's :65059
ProsperPrincipalBorrowed ProsperPrincipalOutstanding
Min. : 0 Min. : 0
1st Qu.: 3787 1st Qu.: 0
Median : 6400 Median : 1590
Mean : 8753 Mean : 2917
3rd Qu.:11700 3rd Qu.: 4109
Max. :72499 Max. :23451
NA's :65059 NA's :65059
LoanCurrentDaysDelinquent LoanOriginalAmount MonthlyLoanPayment
Min. : 0.00 Min. : 1000 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 4000 1st Qu.: 157.3
Median : 0.00 Median : 7500 Median : 251.8
Mean : 36.65 Mean : 9082 Mean : 291.9
3rd Qu.: 0.00 3rd Qu.:13500 3rd Qu.: 388.3
Max. :1593.00 Max. :35000 Max. :2251.5
Investors Prosper_User_Status
Min. : 1.00 First Time Prosper User:65059
1st Qu.: 1.00 Previous Prosper User :19822
Median : 32.00
Mean : 68.28
3rd Qu.: 97.00
Max. :1189.00
Loan_Outcome credit_range credit_history_years
Current Loan :56576 660-679:14137 Min. : 0.00
Successful Loan :19894 680-699:14021 1st Qu.:12.00
Past Due Loan : 2067 700-719:13616 Median :17.00
Unsuccessful Loan: 6344 720-739:11037 Mean :17.88
640-659: 8852 3rd Qu.:22.00
740-759: 7876 Max. :63.00
(Other):15342
payment_to_monthly_income lender_return
Min. : 0.00 Min. :-1.00092
1st Qu.: 0.03 1st Qu.: 0.02189
Median : 0.05 Median : 0.07779
Mean : 0.26 Mean : 0.08203
3rd Qu.: 0.08 3rd Qu.: 0.18758
Max. :6291.87 Max. : 1.26228
After pairing down the original Prosper data set (~114,000 listings with 81 variables) and adding a few variables of my own I began my analysis with 84,881 loan listings each containing 38 features. These features are ordinal numbers, dates, factor variables and ratios. The key factor variables that I generated in my pursuit of my goal of this analysis, are the loan outcome variable; Current, Successful, Past Due and Unsuccessful. And Prosper User Status; First Time Borrowers and Previous Borrower. I also created a the ratio feature of lender return and payments to income ratio. In continuing my analysis I will look more narrowly at the summary features, such as Prosper rating, credit range, lender return, estimated borrower APR, estimated lender return, Prosper user status and loan outcome.
I believe that by looking into all of the features in the uni-variate plots above we can form a better understanding of what makes a borrower a safer or riskier prospect without looking at just one number (e.g. Prosper Rating) which may be an accurate measurement of risk but doesn’t explain how the number was derived.
There are a total of 65,059 first time borrowers and 19,822 previous borrowers. Such a return rate is possibly reflective of Prosper’s loan marketplace working successfully (in the sense that borrowers are coming back and lenders are reinvesting in the loanees). Most loans in this data-set are current (i.e. still on-going), 56,576 to be exact. This means that they are not past due or delinquent in any way and are making payments successfully. These loans will be in different stages of their life (just beginning all the way to their second to last month) but If we continue looking into the loan outcome variable we can hypothesize that many of these loans will be successful based on the following evidence. Of the 26,238 concluded loans, 19,894 of them are successful, they have been completed or are on their last payment. There are 6,344 loans which are unsuccessful, meaning these loans have been charged off (written off), cancelled or are delinquent. Finally in the whole data set there are 2,067 loan listings currently in some stage of past due, these loans are late but not yet failures. At a 3 to 1 success rate (for this snapshot data-set) the number of successful loans in my opinion is to high to declare Prosper’s loan marketplace a failure.
It will be interesting to explore further in this analysis whether or not these roughly 6,000 unsuccesful borrowers were rated as a high risk or if they received more favorable scores on metrics designed to measure risk (Prosper rating, credit range). If most of these borrowers were given scores indicating very high risk then I would conclude that Prosper is doing a good job of accurately informing its investors of potential risks.
The median return rate for all loans is: ~8%, while the median estimated return rate hovers right around 9%.
The most common credit scores for Prosper borrowers is in the range: 660-679 (not terribly high).
The most common Prosper rating is: 4 ( out of 7, 1 being the worst, 7 the best).
The most common loan origination amount is: $4,000.
Most Prosper borrowers have a reported income in the range of: $50,000-$74,999
The most common reason given for the loan request is: Debt Consolidation.
The most common term length for a Prosper loan is: 36 months.
The most common number of investor for a Prosper loan is: 1.
The main features of interest in my data set are the loan outcomes, the lender return(numeric representation of loan outcomes), credit-range and Prosper ratings. Remember that the goal is to determine if Prosper provides a viable platform for individual investor to make loans without so much undo risk that it keeps the average investor from considering the Prosper marketplace as a viable investment option. Ultimately I would like to see if this style of open marketplace for loans can displace institutional loans and whether or not Prosper can adequately bridge the gap of asymmetric information for borrowers and lenders. The achievement of this goal would manifest itself in investors who achieved positive gain on their investments, received higher returns on riskier investments and where a loan does fail the investor was adequately warned with high risk scores on Prosper’s risk rating system.
I’ve looked into all of the features that made it to the plotting stage, most of which have some value in regards to whether or not individual investors are able to use them determine a borrowers creditworthiness. Moving forward I narrow my focus to a smaller number of variables to be plotted in multi-variate plots as they relate to the Prosper rating systems of risk and reward seen by lenders. Some of these features will be debt to income ratios, estimated returns, number of investors, Prosper user status, and Prosper loan payment to income ratio and of course lender returns and lender outcome.
I created several new variables, Prosper User Status; this variable reflects whether the loan listing is from a previous Prosper borrower or a first time Prosper borrower. Loan Outcome; this variable is an aggregation of the loan status variable into just 4 categories, current loans, successful loans, past due loans, and unsuccessful loans. Credit range, this variable combined the upper and lower bound credit scores into a range between those two. credit history in years was created by looking into the first reported credit date and the date of the latest credit check, subtracting those two and converting them into years. Payment to monthly income variable was created to reflect how much of a borrowers income was spent on payments for their Prosper loan. Finally I created lender return to reflect the actual interest that an investor sees from their loans to borrowers. This variable is most informative for successful and unsuccessful loans.
I performed several operations on the data, to begin with I transformed all of the date times to reduce the detail to just the month day and year. I also factored variables which should be categorical (including most of the ones I created). I filtered my data to remove all loan listings before a certain date (2009). I created a function to change the state abbreviations for the purpose of graphing a map with frequency of borrowers by state. Several of my plots have had their axis’s transformed by square root functions and most have been constrained to reflect only feature data up to the \(99^{th}\) quantile in order to weed out potential outliers. All of my plots reflect the density distributions as a result of the data containing large ranges in the scales and different scales between my facet variables. I reordered several of the factor variables so that the axis label orders made more sense in my opinion. I dropped several variables on the basis of their containing mostly ‘NA’ values. I renamed several of my variables in order to shorten the length of feature names. All of the percentages on the x-axis are decimals which I converted only for the plot labels. Additionally for the payments to monthly income variable whose loan listings provided no income information were lumped into the category of borrowers whose payments are more then 100% of their income. There were only ten listings above 100% after this operation therefore whatever error was involved in my decision to impute the data in this fashion is hopefully minor.
The following plots represent a narrowing of focus to variables I believe most helpful in judging the viability of Prosper’s ranking systems. The plots are concerned with variables designed to show whether or not Prospers rating system is effective and if it is superior to the standard credit score. In addition I explore several other variables that appear to have a strong relationship with lender return. The goal here is to determine if the Prosper system truthfully identifies risk in borrowers, this ideally would be seen in low numbers of unsuccessful loans and those loans that are unsuccessful should all have lower credit scores and lower Prosper ratings (1-worst or most risky, 7-best or least risky). The other variables not directly related to credit ranges and Prosper ratings are variables which I felt would offer explanatory power of lender return without having such mysterious origins.
## The pairs plot takes a while to render on my machine and even with
## the limited number of variables does not visually offer much value owing to
## crowding. I will instead explore these variables and others in more depth
## using a larger number of individual plots.
#ggpairs(Prosper_Post09_Subset[,c("lender_return","credit_range",
#"Loan_Outcome","Prosper_User_Status","ProsperRating","EstimatedReturn",
#"BorrowerAPR")],axisLabels = "internal")
## NOTE:
# I found The [R Graphics CookBook](https://rpubs.com/escott8908) an
# invaluable resource when constructing and editing many of my graphs.
# Edgar James Scott II is the author and the resource is on RPubs.
# Several of my layer ideas came from here, in addition I found many
# different ggplot methods, options, and geoms for detailing my plots.
## Background fill technique (annotate) found in some of the plots
# courtesy of sc_evans on
# http://stackoverflow.com/questions/17521438/geom-rect-and-alpha-does-this-work-with-hard-coded-values
Helper Functions - Multi-Variate
ProsperData_concludedLoans <- function(ProsperDataFrame){
#
# Input: Prosper Data Frame
#
# Output: Prosper Data Frame containing only those rows were the loan outcome
# is either successful of unsuccessful. In other words include only those
# listings where the loan has concluded
#
ProsperDataFrame_subset <- subset(ProsperDataFrame,
Loan_Outcome %in% c("Successful Loan", "Unsuccessful Loan"),
drop = T)
# remove the unused levels from the factor variable Loan_Outcome
Prosper_DataFrame_Model_subset <- droplevels(ProsperDataFrame_subset)
return(Prosper_DataFrame_Model_subset)
}
df <- ProsperData_concludedLoans(Prosper_Post09_Subset)
#View(df)
# Helper Function for Plotting Mosaic plots of Prop.table frequency
# for nested or heirarchical factor variables
## following mosiac plotter thanks to Edwin on Stackoverflow:
# http://stackoverflow.com/questions/19233365/how-to-create-a-marimekko-mosaic-plot-in-ggplot2,
# the vast majority of the code is his I added relevent labels,
# titles and adjusted some text angles and removed size legend.
ggMMplot <- function(var1, var2,xlab,ylab,title){
levVar1 <- length(levels(var1))
levVar2 <- length(levels(var2))
jointTable <- prop.table(table(var1, var2))
plotData <- as.data.frame(jointTable)
plotData$marginVar1 <- prop.table(table(var1))
plotData$var2Height <- plotData$Freq / plotData$marginVar1
plotData$var1Center <- c(0, cumsum(plotData$marginVar1)[1:levVar1 -1]) +
plotData$marginVar1 / 2
ggplot(plotData, aes(var1Center, var2Height)) +
geom_bar(stat = "identity", aes(width = marginVar1, fill = var2),
col = "Black") +
scale_fill_hue(ylab) +
geom_text(aes(label = as.character(var1), x = var1Center,
y = 1.05, angle = 45, size=1) )+
scale_x_continuous(breaks=seq(0,1,.05)) +
scale_y_continuous(breaks=seq(0,1,.05)) +
theme(axis.text.x=element_text(angle = 45, hjust = 1),
legend.key.size = unit(0.5,'cm'),
legend.text = element_text(size=rel(.75))) +
xlab(xlab) +
ylab(ylab) +
guides(size= F) +
ggtitle(title)
}
Analytical Correlation Test
## Using the polycor package I ran a hetergeneous correlation analysis
# which interprets standard numerical vs. numerical, polyserial and
# polychoric correlation analysis for my ordinal and factor variables.
corr <- hetcor(df[,
c("EstimatedReturn","lender_return","credit_range",
"Loan_Outcome","ProsperRating","Prosper_User_Status",
"DebtToIncomeRatio", "IncomeRange","IsBorrowerHomeowner",
"Investors","BorrowerAPR"
)],use="pairwise.complete.obs", digits = 5, std.err=FALSE)
cat("Correlation Analysis:","\n")
Correlation Analysis:
corr[1]
$correlations
EstimatedReturn lender_return credit_range
EstimatedReturn 1.00000000 -0.052237178 -0.386983387
lender_return -0.05223718 1.000000000 0.006430756
credit_range -0.38698339 0.006430756 1.000000000
Loan_Outcome 0.31513543 -0.946029331 -0.205601646
ProsperRating -0.57200260 0.046885702 0.607537706
Prosper_User_Status -0.11272860 0.083508172 -0.323607090
DebtToIncomeRatio 0.08905845 -0.042861707 -0.031495873
IncomeRange -0.14934948 0.095770526 0.171155945
IsBorrowerHomeowner -0.08370432 0.040796410 0.352298902
Investors -0.19522335 0.045234650 0.384662455
BorrowerAPR 0.71128920 -0.050145569 -0.600140683
Loan_Outcome ProsperRating Prosper_User_Status
EstimatedReturn 0.31513543 -0.5720026 -0.11272860
lender_return -0.94602933 0.0468857 0.08350817
credit_range -0.20560165 0.6075377 -0.32360709
Loan_Outcome 1.00000000 -0.3305395 -0.08951642
ProsperRating -0.33053953 1.0000000 0.04548200
Prosper_User_Status -0.08951642 0.0454820 1.00000000
DebtToIncomeRatio 0.09992526 -0.1333645 0.04185234
IncomeRange -0.21495722 0.2338915 0.02991423
IsBorrowerHomeowner -0.09212040 0.1416776 0.00697690
Investors -0.12298196 0.4490048 -0.03289499
BorrowerAPR 0.35053637 -0.9821437 -0.10317703
DebtToIncomeRatio IncomeRange IsBorrowerHomeowner
EstimatedReturn 0.089058447 -0.14934948 -0.083704322
lender_return -0.042861707 0.09577053 0.040796410
credit_range -0.031495873 0.17115595 0.352298902
Loan_Outcome 0.099925261 -0.21495722 -0.092120396
ProsperRating -0.133364452 0.23389154 0.141677643
Prosper_User_Status 0.041852339 0.02991423 0.006976900
DebtToIncomeRatio 1.000000000 -0.20630815 -0.005005497
IncomeRange -0.206308149 1.00000000 0.405166167
IsBorrowerHomeowner -0.005005497 0.40516617 1.000000000
Investors -0.044315658 0.20862179 0.140656840
BorrowerAPR 0.126989322 -0.21742400 -0.137975903
Investors BorrowerAPR
EstimatedReturn -0.19522335 0.71128920
lender_return 0.04523465 -0.05014557
credit_range 0.38466245 -0.60014068
Loan_Outcome -0.12298196 0.35053637
ProsperRating 0.44900478 -0.98214374
Prosper_User_Status -0.03289499 -0.10317703
DebtToIncomeRatio -0.04431566 0.12698932
IncomeRange 0.20862179 -0.21742400
IsBorrowerHomeowner 0.14065684 -0.13797590
Investors 1.00000000 -0.41515967
BorrowerAPR -0.41515967 1.00000000
[1] "Mosiac plot of Prosper ratings by credit ranges"
This chart has some crowding in the credit-range labels but still has some visual value. This mosiac plot is charting the porportions of credit_ranges and nested porportions of Prosper ratings. We can see a that for extremes of credit ranges we do not see very many borrowers and that for higher credit ranges we see larger portions of higher Prosper ratings (1=worst, 7=best). We would expect to see this given that credit range from the previous chart did a good job of explaining lender return. Given this fact it is not to much of a stretch to believe that the Prosper rating would largely reflect the credit ranges. There is suprisingly a large range of Prosper rating for all of the credit ranges even the lowest and highest suggesting some deviation between the ratings and the credit ranges.
[1] "Lender return as a function of Prosper ratings and nest credit ranges"
The plot suggests that Prosper ratings contain a broad range of credit_ranges, hinting that more goes into the crafting of Prospers Rating then just the credit score. That being said we see a trend of lower credit ranges for lower Prosper ratings and higher ranges for higher ratings. It’s clear that credit ranges are not perfect predictors when we look at a couple of the Prosper ratings whose I.Q.Rs are strictly in the positive returns we see that some of the higher credit ranges I.Q.Rs extend into the negative returns while lower credit range to do not. Again with the exception of a weird fluke (Pros. Rating 4, credit range: 860-879) all the median values lie in positive return territory. It looks like we see the longest I.Q.Rs for the middle range credit scores. Could be a result of most Propser borrowers falling into the middle credit range categories.
[1] "Density of Listing be credit range and by Prosper rating for loan outcomes"
For both credit range and Prosper rating we can see that for unsuccessful loans we have on avarage higher numbers of loan listings with lower credit ranges and Prosper ratings. It does appear the the Prosper rating does a better job of accurately rating with a low score those loans that do wind up failing. Its also good to see more uniform distribution of lisiting counts of successful loans accross all the ratings, this is a good sign for borrowers who can tolerate a higher degree of risk (can acheive higher returns without worrying overly much that they will loss their investment). Given that the unsuccessful loans listing counts are heavily skewed towards lower scores and yet there is still a large portion of low ratings in the successful loans it could be the case that Prosper is a bit conservative in their rating system.
[1] "Estimated returns mapped against actual returns"
It is interesting to see any loan listings whose estimated return is less then 0% being funded. This plot and the next emphasize just how hard the real world is to predict and model. While not completely off bases the estimated returns is not anywhere near perfectly correlated with actual returns seen. We can double check this conclusion if we look at the correlation analysis run previously. Returns are both a lot less and a lot more then predicted. Many of the negative actual returns were seen for higher estimated returns, this is good in that it suggests that Prosper loan risk is being rewarded with higher estimated returns.
[1] "Distribution of or lender returns by estimated and actual for loan outcomes"
We can see that estimated returns follow a normal distribution where as actual results are more of a poisson distribution. Similar to the single variate plots for unsuccessful loans, when the loans fail they fail badly (loss of most if not all of an investors capital) for the most part.
[1] "Boxplot analysis of debt-to-income ratios for first time \n and prior borrowers across loan outcomes"
There is a clear difference between successful loans and unsuccesful loans whereby unsuccessuful loans have higher debt-to-income ratios across all of the Prosper ratings. What is more interesting is the difference between first time and previous borrowers. for the most part previous borrowers have higher debt to income ratios.
[1] "Boxplot of Investors by Prosper rating, loan outcome and user status"
The main difference I see here is between first time and prior borrowers. Although not extreme there is a difference between first time and previous borrowers whereby the I.Q.Rs of first time borrowers spans a greater range then prior borrowers and tends to have more investor on average. It also look like successful Loans have higher numbers of Investors accross the board, reasonable if we assume most investors dislike risk.
[1] "Status of homeowner vs Loan outcome"
We can see higher Prosper ratings and successful loans for homeowners vs. non-homeowners.
[1] "Lender Return by Credit Range, Prosper Rating and User Status"
For a majority of loan lisitings many of the credit ranges for prior borrowers have a slightly higher concentration in positive return territory (seen by the fatter violin shapes ). This appears to be the case because the ranges of outcomes are less extreme then for first time borrowers. There are a few outliers even in the highest of credit ranges, however for the most part we can see fewer and fewer failed loans (lender return < 0%) the higher up the credit range we walk (fatter and fatter distributions in the positive territory). For all the credit ranges the majority of listing are in positive lender return territory. To compare this to Prosper’s proprietary rating system we see a lot of similarities. Again for the lower rating 1-3(high risk) we see distributions that are narrower and span the positive and negative return divide. For Prosper ratings higher then 3 we see mostly positive returns. I believe this reflects (and we can this see in the correlation analysis) that the Prosper rating does a slightly better job of determining when a borrower is going to be successful and when they are likely going to be unsuccessful. The differences between the first time and prior borrowers is not as clearly defined when compared to the Prosper ratings vs. the credit range. It looks like that for all Prosper ratings the density distributions are slightly more uniform. This could reflect more accuracy in terms of Prosper ratings. It appears that investors at Prosper do a good job of selecting those prior borrowers with low Prosper ratings who will actually be successful. We can see that for Prosper ratings of 1 for previous borrowers the fatter part of the distirbution falls generously into the territory of positive returns.
[1] "Mosiac plots for Prosper rating and loan outcome and \n credit range and loan outcome"
The previous two plots are mosiac plots, they represent the proportions of two variables one nested inside the other. In the first plot I see a majority of current borrowers with Prospers ratings 4 and above. We can see higher failure rates for high risk borrowers (low rating) however there are a lot of successful loans compared to failures across all of the Prosper ratings. For the credit rating we see similar results simply spread out over more factor levels. It might be the case that there is a more uniform distribution of unsuccessful loans across all of the credit ranges compared to the Prosper rating but it is rather hard to tell. It is however clear that in terms of predicting the binary classification of successful and unsuccessful loans Prosper rating are not significantly superior to credit ranges. We’re I putting my own money on the line and I only had the options of seeing one metric of risk I would choose the Prosper score. It is the case that the least distinction between credit ranges occurs in the the most common ranges of credit score, as a result I believe that we can achieve a little more certainty with the Prosper rating. This is good thing in terms of proving the viability of Prospers risk measurements.
[1] "Scatter plot of investor by lender return and Prosper rating"
We can see that for the most part lower return loans that are also lower risk (higher Prosper Rating) have more investors. For those loans that have failed (i.e. negative return) we see a trend of lower Prosper ratings but little differences in the number of investors with perhaps the only trend being less investors on average in loans that are unsuccessful. Recall that in the analysis of the univariate section we found that the most common number of investor is actually 1. I would have suspected that most higher risk loans would have a larger number of investors as this would seem a natural free market method to protect against loss when you are not able buy insurance against it. However it looks like the size of the loan is a much larger factor in determining number of investors not only that but the loan size may even provide some explaination of Prosper ratings and lender return. The greatest returns are seen by those brave few who invests in high risk borrowers with smaller loans.
I found clear relationships between Prosper ratings and credit ranges. Furthermore and fortunately for Prosper investors there is also a relationship to be found between Prosper ratings and loan outcomes, both in a broad sense, did the loan succeed or fail and in a more minute way such as was the return minimal or large. There are also relationships to be found between the number of investors and the return seen by the investors. Loan outcomes have some relationship to homeowner status, and throughout all of my variables of interest the difference between first time and prior borrowers was noticeable. There are correlations to be found among estimated and actual lender returns although not as much as I might have guessed. When looking at lender returns we can see that for lower positive returns we see higher Prosper ratings (less risk), higher credit_ranges (less risk), homeowner status, lower debt to income ratios and a larger number of investor per listing. Conversely we see higher return and most negative returns associated with lower Prosper ratings (high risk) and credit ranges (although not as clearly as Prosper ratings), lower number of investors and higher debt to income.
The strongest relationship found was between my loan outcome and lender return, but these two are directly related and therefore not very interesting when looking at correlations. The second strongest relationship is between Prosper Rating and Estimated Return, again these should be closely tied together as one is suppose to depend on the other. There is a relationship to be found, although not a strong one, between Prosper rating and loan outcome. There’s a much stronger relationship between credit range and Prosper Rating, which is not to surprising after reviewing the above charts. We can find a relatively strong relationship between Prosper user status and credit ranges. In regards to several of my features of interest, lender return, Prosper rating and credit ranges we see weak correlations between these variables and income ranges.
## The following supervised classification model is only a quick and simple
# prototype to look at the general importance of the different variables.
# A more rigorous and likely superior model could be generated with
# cross-validation and careful perturbation of the hyper-parameters
# (i.e. nodesize, mtry and maxnodes). This model is not meant to be a
# concentrated effort in contructing an optimized supervised learning
# model to predict successful and unsuccessful loans, rather it is
# designed to give a rough idea of the importance of the variables in
# the data set that are used in pursuit of that goal. Furthermore,
# If the goal was to apply a supervised learning model to the Prosper data
# I would likely build a regression model and attempt to predict actual lender
# returns within some small measure of error. In the case of predicting
# actual returns the importance of the variables may change.
## The following random Forest classification model was run using Breiman and
# Cutler's excellent Random Forests for Classification and Regression r package.
# A quick review of the packages documentations
# [https://cran.r-project.org/web/packages/randomForest/randomForest.pdf]
# indicates that much of the power of this packages remains under-utilized and
# unused in the following model. As explained above the models purpose was to
# build a quick prototype and analysis the resulting variable importance not
# to build the optimal classfication model.
classificationModelPrep <- function(ProsperDataFrame){
#
# Input: ProsperDataFrame
#
# Output: Prosper Data Frame where key, listing date, and lender return
# features have been removed. In addition the Data Frame will be subsetted
# by the outcome variable Loan_Outcome so that only those loans that have
# concluded will be used as the outcome variable. Successful and Unsuccessful.
# Finally for the purposes of predicting current loans we will drop those
# variables that act as progress reports for the loans. Meaning those
# variables that an investor would not have data for when looking into
# potential borrowers who have not borrowed before. This means dropping
# Loan Current Days Delinquent which is data only acquired after a loan
# has been given.
#
Prosper_DataFrame_Model <- subset(Prosper_Post09_Subset,
select = -c(ListingKey,ListingDate,BorrowerState,
lender_return, LoanCurrentDaysDelinquent))
Prosper_DataFrame_Model_subset <- subset(Prosper_DataFrame_Model,
Loan_Outcome %in% c("Successful Loan", "Unsuccessful Loan"),
drop = T)
# remove the unused levels from the factor variable Loan_Outcome
Prosper_DataFrame_Model_subset <- droplevels(Prosper_DataFrame_Model_subset)
return(Prosper_DataFrame_Model_subset)
}
## Remove remaining NA's from the dataFrame for random forest classification
Prosper_Data_Frame <- na.omit(classificationModelPrep(Prosper_Post09_Subset))
train_test <- createDataPartition(Prosper_Data_Frame$Loan_Outcome,
times = 1, p = .7, list=F)
# Split data into training and test set, I will avoid cross-validation in this
# case to keep the run time down on the model training.
train_data <- Prosper_Data_Frame[train_test[,1],]
test_data <- Prosper_Data_Frame[-train_test[,1], ]
train_data_x <- subset(train_data, select =-c(Loan_Outcome))
train_data_y <- train_data$Loan_Outcome
test_data_x <- subset(test_data, select =-c(Loan_Outcome))
test_data_y <- test_data$Loan_Outcome
Prosper_train_clf <- tuneRF(x = train_data_x, y = train_data_y,
stepFactor =2, improve =.05,
doBest=TRUE, plot=F,
xtest = test_data_x,
ytest = test_data_y, ntree = 750,
importance = T, na.action = na.omit)
## mtry = 5 OOB error = 19.34%
## Searching left ...
## mtry = 3 OOB error = 19.14%
## 0.01001001 0.05
## Searching right ...
## mtry = 10 OOB error = 19.16%
## 0.009009009 0.05
Prosper_train_clf
##
## Call:
## randomForest(x = x, y = y, xtest = ..1, ytest = ..2, mtry = res[which.min(res[, 2]), 1], importance = ..3, na.action = ..4)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 19.24%
## Confusion matrix:
## Successful Loan Unsuccessful Loan class.error
## Successful Loan 4039 84 0.02037351
## Unsuccessful Loan 910 133 0.87248322
## Test set error rate: 18.39%
## Confusion matrix:
## Successful Loan Unsuccessful Loan class.error
## Successful Loan 1743 24 0.01358234
## Unsuccessful Loan 383 63 0.85874439
importance(Prosper_train_clf)
## Successful Loan Unsuccessful Loan
## Term 8.403888 13.01865037
## BorrowerAPR 17.685860 7.78485925
## EstimatedReturn 16.011397 16.60313963
## ProsperRating 9.967724 3.97975092
## ListingCategory 2.843616 -3.98392571
## EmploymentDurationMonths 4.298338 1.81826980
## IsBorrowerHomeowner 6.550472 -2.49792782
## CurrentCreditLines 8.108977 -0.10087558
## OpenRevolvingMonthlyPayment 13.009331 -3.96307022
## InquiriesLast6Months 2.886770 1.39912420
## CurrentDelinquencies 3.707939 10.91200019
## AmountDelinquent 7.003342 10.72480399
## RevolvingCreditBalance 13.904691 -6.02879855
## BankcardUtilization 16.605896 -4.09865366
## TradesNeverDelinquent..percentage. 9.083406 -1.21604421
## TradesOpenedLast6Months 3.136958 -0.02810674
## DebtToIncomeRatio 8.566458 12.08684793
## IncomeRange 8.558326 2.94419679
## IncomeVerifiable 0.000000 0.00000000
## TotalProsperLoans 4.357094 -1.67353901
## TotalProsperPaymentsBilled 9.074343 2.08429430
## ProsperPaymentsLessThanOneMonthLate 8.612507 6.96822374
## ProsperPaymentsOneMonthPlusLate 1.997018 3.94583861
## ProsperPrincipalBorrowed 9.426598 -2.08150738
## ProsperPrincipalOutstanding 8.568429 8.28602051
## LoanOriginalAmount 16.044630 -4.41582220
## MonthlyLoanPayment 16.250981 -5.67672946
## Investors 5.021100 -1.09464745
## Prosper_User_Status 0.000000 0.00000000
## credit_range 21.475231 2.46128044
## credit_history_years 7.840259 -3.46972582
## payment_to_monthly_income 9.822130 6.51634005
## MeanDecreaseAccuracy MeanDecreaseGini
## Term 14.1546883 27.393734
## BorrowerAPR 21.8745533 91.726998
## EstimatedReturn 21.6422616 96.558999
## ProsperRating 12.4394002 46.824669
## ListingCategory 0.5976029 57.377818
## EmploymentDurationMonths 4.6570041 73.557787
## IsBorrowerHomeowner 4.8188947 10.801219
## CurrentCreditLines 7.4387086 58.273946
## OpenRevolvingMonthlyPayment 10.9841897 73.092336
## InquiriesLast6Months 3.3013327 36.143821
## CurrentDelinquencies 8.9561047 26.812186
## AmountDelinquent 11.6767474 33.992487
## RevolvingCreditBalance 11.5327889 69.940326
## BankcardUtilization 14.4269480 68.376214
## TradesNeverDelinquent..percentage. 8.0791046 52.611483
## TradesOpenedLast6Months 2.7426713 30.397039
## DebtToIncomeRatio 14.5851258 86.290572
## IncomeRange 9.8108458 37.829788
## IncomeVerifiable 0.0000000 0.000000
## TotalProsperLoans 3.2204494 16.162151
## TotalProsperPaymentsBilled 9.2836913 67.762209
## ProsperPaymentsLessThanOneMonthLate 10.8785280 30.298676
## ProsperPaymentsOneMonthPlusLate 3.8415179 6.242418
## ProsperPrincipalBorrowed 8.1079129 62.925878
## ProsperPrincipalOutstanding 11.7168266 76.151754
## LoanOriginalAmount 15.9517122 58.090705
## MonthlyLoanPayment 15.1301583 73.602355
## Investors 4.0918352 69.489699
## Prosper_User_Status 0.0000000 0.000000
## credit_range 21.9132838 86.488115
## credit_history_years 5.3275914 61.391330
## payment_to_monthly_income 13.3954722 62.569399
varImpPlot(Prosper_train_clf)
The initial Prosper data set contained a large quantity of information regarding its users and after deciding I would approach this data-set as if I were a potential investor investigating the Prosper marketplace I set to work examining the key variables I felt most relevant to that archetype. After deciding on a general goal of analyzing the efficacy of Prospers peer-to-peer marketplace on the bases of whether or not the creditworthiness of its users could be adequately assessed and accounted for, I began munging the data set. After looking over the variables I removed all those loan listing pre-2009. Reflecting that 81 variables is not conducive to brief exploratory analysis I dropped many variables that either contained information I was not interesting in exploring or whose data was a part of another variable. Deciding to focus on the outcome of the loans, chiefly whether it was successful or not and desiring to see the actual return rate investors received I created several new variables to capture this goal, loan outcome and lender return.
Reducing my variables down to 38 I proceeded to create ‘uni-variate’ plots of most of these variables, with the understanding that I would not be exploring many of them any further. As I was not going to explore the variables further but still being interested in understanding how they relate to my groups and goal, I introduced my new group variables to the plots through the use of faceting. I found that by faceting I shed a lot more light on the variables relationship to my goal then had I simply looked at the collective. Many of these plots confirmed that the variable had some relationship to my groups. It quickly become clear however that very large visual differences were not going to be seen between any of the variables and my groups. This makes sense, if it were the case that a few key variables explained all of a borrowers credit worthiness there would not be much risk in lending nor art in deciding which variables in a borrowers credit history accurately predict creditworthiness. Even though most of the uni-variate plots contained variables that were not going to be explored further I felt it would be valuable to see these variables and how they changed based on my groups as their origins are easier to understand then a credit score or Prosper rating.
Wrapping up the faceted single variable plots I shifted my focus down to a smaller set of variables which were explored in more detail. Running a correlation analysis on the reduced set of features revealed a mix bag of correlations. Some features such as debt to income ratio and lender return share almost no correlation which I found very interesting as their does appear to be a relationship between debt to income ratio and Prosper rating. Other feature correlations such as Prosper user status are related to loan outcomes, suggesting that knowledge of prior Prosper loan performance is a superior metric than most first time borrower metrics. In regards to my features of interest the variables most strongly correlated with lender return and loan outcome are Prosper rating, estimated return, Prosper user status and income range. Following the correlation analysis I plotted my features of interest along with several other features that vary noticeably with my loan outcomes and lender return. For both the credit range and Prosper rating we see correlation with lender return. We also find in the mosaic plots a clear correlation between the Prosper rating and credit range variables themselves. Continuing I found a trend whereby lower Prosper ratings and credit ranges are more likely to be associated with failed loans (negative returns) and if we look at the variables of importance in the random forest classification model we can see this numerically.
While the model is fairly basic it suggests that estimated return plays a large part in predicting loan outcomes. The Prosper rating and credit range both play a part both in predicting successful loans but also in predicting unsuccessful loans. Ultimately the variables which offer the most information in terms of accurate predictions are estimated return, borrower APR, credit range, payment to monthly income and debt to income ratios. It is interesting to note that the debt-to-income ratio is important to the model accuracy but the correlation analysis did not find a strong correlation between loan outcomes and debt-to-income ratios.
The Prosper data set offered a unique look into its borrowers and the platform itself is certainly unconventional. After briefly exploring the data I would feel much more comfortable about investing my own money in this marketplace. As with any lending transaction there certainly exist some risk that you will lose your money, however loans that typically fail this badly are very often ranked as high risk. The real science to achieving success as an investor with Prosper would be picking those loans rated as higher risk but still succeed with regularity. Prosper, perhaps to protect its investors and possibly to generate more revenue, appear to be conservative in their rating methodology, at least as it compares to standard credit ranges. Clearly we do not want to see any failed loans, but If we do we would like to see them with very high risk ratings. There is no perfect marketplace of loans where getting paid back is a certainty and in Prosper’s marketplace the individual borrowers will likely have a harder time weathering lost capital then say a larger bank. That being said I think that for those investors willing to except a little risk the Prosper marketplace provides a viable platform for investing your money especially in higher rated loans.
Working with so many factor variables constrained some of my plotting options and lead to my use of some less statistically rigorous analytical techniques to find correlations (pair-wise Mann Whitney and heterogeneous correlation tests). Several of my variables were abandoned as a result of likely containing too many lurking variables. I believe that a more robust analysis of Prosper loan listings could be performed with a longer time frame and I would enjoy an opportunity to review loan listings over a longer time frame. Much of this analysis relies on the belief that the particular quarter this data was drawn from is representative of all Prosper loan listings. This assumption may be a bit of a stretch. Also I’d like to to see if I couldn’t construct a model that would predict actual lender return as a percentage, furthermore it would be interesting to see if I couldn’t optimize my current classification model to improve the accuracy above 90%. It should be noted that my current models importance measurements may be biased (Understanding variable importances in forests of randomized trees). This might explain some of the disparity between my correlation analysis and variable importance report. I had to do a fair bit of searching for methods of displaying some of the data with really large ranges in scale of data specifically when it included both negative and positive values. I also ran into some difficulty when using more than a 12 level factor variable to color plots, apparently there is a limit to what ggplot can automatically handle when coloring on a factor variable. I found it necessary to cut out some of the data included in a few of the variables that may have been outliers or real data, regardless these data points prevented me from seeing the meat of the data when plotted. The biggest challenge in this analysis may simply have been the size of the data set (variables) and having to cut out many variables and possible relationships to reach even this not insignificant length. The only other point of frustration was the slightly esoteric feature descriptions, not being in the industry I would have found some domain knowledge helpful in this case. I looked up a lot of the lending lingo for the features but having a better understanding of what is normal in the traditional lending market could have been helpful in determining just how unique or normal Prosper users are compared to traditional borrowers.